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Abstract 

Offline software using TCP/IP sockets to distribute particle physics events to multiple UNIX/RISC 
workstations is described. A modular, building block approach was taken, which allowed tailoring to solve 
specific tasks efficiently and simply as they arose. The modest, initial cost was having to learn about sockets 
for interprocess communication. This multiprocessor management software has been used to control the 
reconstruction of eight billion raw data events from Fermilab Experiment E791. 

1. The E791 Reconstruction Task 

Fermilab Experiment E791 accumulated a large dataset (50 Terabytes, 20 billion events, 24000 8mm 
Exabyte tapes) in 1991 and early 1992 [1,2]. As might be expected, the reconstruction and analysis of this 
data challenged available computing resources; event reconstruction alone required over 10 000 mips-years 
of processing power. For 2^ years, reconstruction processing was underway at four different locations [3]. 
The three largest sites used clusters or farms of commercial UNIX/RISC workstations connected together 
by thin- wire Ethernet. Within each farm, many processors operated together; data management and system 
control were exerted from a single point via multiprocessor management software. Here we describe the 
multiprocessor management software developed and used at the University of Mississippi [4] . The Mississippi 
farm hardware is shown in Figs. 1 and 2. 

Most of the large-scale computing needs encountered in particle physics - reconstruction and analysis of 
recorded data and generation of simulated data - are event-oriented. Each event's data packet is extracted 
from an input stream and processed in isolation from other events. The results from each event's analysis 
are merged into an output stream. The computing power required to process each data packet is significant 
relative to the time needed to transport the data, even over datapaths of modest throughput. Many other 
scientific computing problems (e.g. small-object recognition in astronomical images) conform to the same 
model. In all such problems, it is trivial to divide the total computing task over many processors; each 
processor is simply given a share of the events to process independently of its peers. 

Most reconstruction and simulation tasks are developed and tested as single-processor programs. They 
can be outlined in the following manner: 

initialize data structures 

initialize input and output data streams 

read run startup information from input data stream 

read and store run calibration constants 

for every event from the input stream... 
unpack and check the input data 
PROCESS THE EVENT (most of the work) 
for good events, 

pack the processed event into output stream 

produce a report characterizing the processing job 
stop 
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When the program is "nearly working," it is then moved into the multiprocessing environment for final 
testing and production. If it has been written in a sensible fashion, with dataflow management tasks cleanly 
separated from event processing tasks, then the transition from single processor to multiprocessor system 
should be relatively painless. Dataflow management is generally vested in a single server process, while the 
event processing is performed by numerous client processes running in many processors. The server grants 
the clients access to calibration data, reads the event input stream, parcels out events to the many client 
processors, gathers output data from the clients, and writes the output stream. The server processor is 
typically well endowed with peripheral devices (disks, tape drives), whereas the clients may be very simple 
(but powerful) processors with no peripherals except a network connection. 

2. The Multiprocessor Manager's Tasks 

In the first implementations of event-oriented multiprocessing for particle physics [5,6], the multiprocessor 
management software had to do a great deal of hard work. Clients frequently had no significant operating 
system, primarily because memory was expensive. Executable code and calibration data had to be formatted 
in the server and downloaded into the clients word by word. Extraction of input events from records and 
merging of output events into records often had to be done in the server because of limited client buffer 
memory, posing the danger of a processing bottleneck at the server. Reports had to be retrieved from the 
clients as tables of numbers, to be formatted and output by the server. Substantial effort was often required 
to move a single-processor program onto a multiprocessor system, because the division between server code 
and client code was intricate. Developing the multiprocessor management code could become nearly as large 
a task as developing the event processing software: an odious burden. 

With the advent of cheaper memory (enabling each client to have a real operating system and large event 
buffers) plus good networking software, most of the difficulties inherent in moving code to a multiprocessor 
vanished. Rather than relating "war stories" of heroic efforts and brilliant strategies leading to victory 
in the face of staggering difficulties, this paper celebrates the fact that multiprocessor management is now 
straightforward, that an effort tiny compared to the development of application code will suffice to distribute 
the workload efficiently and reliably among many processors, and that generic system software nearing 
completion promises to make the task even easier in the future. 

3. A Disk-Based Multiprocessor Manager 

It is easiest to understand the multiprocessor strategy by considering first a disk-based system. The 
clients each have a real operating system (in our case ULTRIX, a flavor of UNIX) and thus can read and 
write disk files, but they have no disks attached directly to them. Instead, clients have access over the 
network to disks attached to the server. Using Network File System (NFS) software, disks or portions of 
disks can be "cross-mounted" so that they are accessible by multiple processors. For us, this meant the 
server itself and all of its clients. NFS is supplied as standard software with many types of workstations, 
and is available as an option on most of the rest. 

The fact that clients, though diskless, have full access to disk services across the network, immediately 
solves several problems that burdened earlier systems: 

(1) Clients can read executable code and start programs; it is no longer necessary for 
the server to micromanage the downloading and startup of clients, though high level 
control of client processes is still maintained in the server, 

(2) Clients can read their own calibration files; it is no longer necessary for the server to 
explicitly read, reformat, and transmit calibration data for the clients, 

(3) Clients can write their own report files; it is no longer necessary for the server to 
explicitly extract, format, and write reports for the clients. 

In many cases, cross-mounted disks can even be used to provide for the movement of event data (input 
and output) between server and clients: 
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(4) The server can read events from tape and write them to disk; the clients can read the 
events from disk and process them. For example, each client can own two input files; 
the server inserts events into one while the client consumes events from the other, 

(5) The clients can write output events to disk; the server can read the disk files, and write 
the events to output tapes. Each client can own two output files; the client inserts 
processed events into one while the server removes events from the other and writes 
them to tape. 

Referring back to the outline of the processing tasks, it is clear that the program running in each client 
of a multiprocessor system is very similar to the single-processor program. Each client loads its own code 
and starts it. Each client reads its own calibration constants from disk. Each client writes its own reports to 
disk. Each client reads its input data from disk and writes its output data to disk, just as single-processor 
code usually does during program development. The server's tasks now become the following: 

(1) Keeping a list of the available processors and requesting them to start instances of the 
processing program, 

(2) Making startup data (e.g. run number) available to clients in a disk file, 

(3) Assuring that valid calibration constants are available to be read by the clients, 

(4) Reading input data from tape and distributing events to the various client disk files, 

(5) Fetching output data from the various client disk files and writing it to tape, 

(6) Gathering reports from the client report files and producing (where appropriate) a 
system-wide report. 

This scheme was very simple to implement, maintain, and document. For applications in which dataflow 
is slow, it works very well. Clients are provided with certified events (or blocks of events) on a reliable 
medium, protected from the vagaries of tape reading, which is handled by the server. Likewise, clients have 
a reliable medium for writing output; problems with tape writing are handled in the server. If enough disk 
is available, whole tape files can be staged through the disk files, leading to more reliable performance from 
streaming tape drives [7] . 

4. A TCP/IP Sockets Based Multiprocessor Manager 

As a farm's data throughput is increased by adding processors, upgrading processors, or making applica- 
tion code run faster, passing the event stream through disk can become a bottleneck; clients may spend too 
much time waiting for input events to be read and output events to be written, and disk "thrashing" (inces- 
sant head motion) may set in. One possible solution is to move the event streams through cross-mounted 
memory disks using special software by which a portion of the server's memory is set aside and appears to 
the user as a very high-speed, low-latency, non-thrashing, disk drive. When we explored this option in 1992, 
such software was available but did not perform well; it is possible that suitable products now exist. Memory 
disks preserve the advantage that the application code for a multiprocessor system looks almost identical to 
that for a single processor; all data passing is done with normal Fortran reads and writes. 

Instead, our multiprocessor manager bypasses disk altogether for those portions of the dataflow that are 
fast enough to challenge the disk's throughput; data is moved between processes using TCP/IP (Transmission 
Control Protocol/Internet Protocol) network services [8]. Using this facility, processes can establish a connec- 
tion between themselves and pass data back and forth by read-from-connection and write-through_connection 
subroutine calls. Event input/output can no longer be implemented as simple Fortran reads and writes in 
the client (a disadvantage), but on the other hand, the high throughput of direct network data transfers 
becomes available. 

Before deciding to implement a TCP/IP-based scheme, we had to answer three significant questions: 
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(1) Would passing data through TCP/IP yield a significant improvement in throughput 
over NFS disk-based data transfers? 

(2) Would resulting modifications to the client code be tractable? 

(3) Given very limited time to produce the management code, would TCP/IP be easy 
enough to learn and use, especially for those of us coming from a VAX/ VMS — Fortran 
environment? 

We wrote a test package to shuttle data between several processors to measure throughput and reliability. 
The results: average data throughputs in excess of 900 kilobytes per second (almost full Ethernet speed) could 
be maintained (we needed only 150 kilobytes per second); the impact on client efficiency was immeasurably 
low at maximum projected throughput; and data corruption was not seen. 

In keeping with the Fortran orientation of the experiment's software, Fortran-callable functions were 
written in C for all of the system services needed to support TCP/IP data transfers. As a result, the 
changes to the client code were minimal and easily understandable to Fortran-only programmers. 

Fig. 3 illustrates how the network I/O calls are used. As the server prepares to start a client, it uses 
makesocket to "have a phone put in" , so that it will be able to connect to the client. When the client starts, 
it too uses makesocket to "have a phone put in". The server "lists its number" by binding its socket to 
a port (bindsocket) , and "stays near the phone" listening for an attempt to connect (listens ocket). The 
client "calls up" the server (connectsocket) and the server "picks up the phone" establishing the connection 
(acceptsocket) . 

When it needs input data, the client "places its order" by writing a message to the server (writes ocket) . 
The server is continually monitoring all of the client connections for requests (s elects ocket) . When a request 
comes in, the server "writes down the order" (reads ocket) , and does its best to satisfy the client's request. 
The client and server shuttle messages back and forth (each writing to the other and reading from the other) 
until the input data is exhausted. At that point, the server notifies the clients that there is no more data. 
The clients then finish their tasks, close their connections (closesocket) , and exit; the server finishes its tasks 
and exits. 

The Fortran-callable routines that manipulate sockets and connections really are that simple to use. Only 
one routine (readsocket) is any more than a C to Fortran interface, and even readsocket is trivial. Almost 
all of the real work involved with network communications has already been done in UNIX, TCP /IP, and 
Berkeley Sockets. 

5. Performance 

Only the transport of input events from server to clients was implemented in TCP/IP; output events 
from clients to server needed to be staged to disk to make best use of our Exabyte tape drives, so they were 
written to disk through NFS as before. This output scheme was efficient because five out of six events were 
rejected by a filter after reconstruction and were not output. 

Tape reading was multiply buffered, so that events were almost always available immediately when a 
client requested them. We used a simple trick to ensure that clients were not suffering from delays in 
receiving new events. Whenever input data was available, the server checked the clients to see which ones 
were asking for new input. The server always checked the clients in the same order, so processors at the 
start of the list had priority over processors near the end of the list. If clients were "spinning," i.e. waiting 
for events, this was reflected in anomalously low throughput in the least favored processors. By this and 
several additional measures, it appears that more than 97% of the client cycles were being put to beneficial 
use (actually processing events) whenever the farm was running at all. 

Funding awarded in June 1993 allowed an expansion of the Mississippi computing facility from 1100 to 
2900 mips. By July 1993, the increased computing power had been acquired and was processing data. E791 
reconstruction was completed in September 1994. A total of eight billion events on 10 000 raw data tapes 
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were processed at the University of Mississippi. Preliminary versions of the reconstruction software were 
run to find out which E791 algorithms would yield the most physics per tape. One result was a tripling 
in the yield of charm particles. When the final reconstruction software was ready, it was run. Overall 
efficiency, considering all cycles lost for any reason, exceeded 90%. Thus the multiprocessor manager has 
proven itself to be efficient and robust under varying conditions in what we believe to be a fairly typical 
university operating environment. 

For the E791 experiment as a whole, the successful management of multiple processors has provided 
the full reconstruction of over 200 000 particles with a charm quark. This large charm sample is in turn 
generating new physics results [9-15]. 

6. Scavenging Computing Cycles 

Although most of the processors in the Mississippi system are mounted in racks at a central location, 
some are at people's desks and serve as general purpose workstations. Many workstation activities - editing, 
compiling and running small programs, reading and writing e-mail, etc. - can coexist with the farm client 
process running in the background. However, there are some workstation activities that are incompatible 
with farm operations, and the client process must be removed from the workstation. Nevertheless, it is 
certainly helpful to be able to scavenge the desk workstation cycles when they are otherwise unused. 

There are several approaches to using the desk workstations as farm clients. On one extreme, the 
workstation can be removed from the farm client list; it will never be used as a farm client. At the other 
extreme, one can forbid workstation activities that interfere with the farm operation. In between there is 
ample middle ground. For example, it is possible to have the farm server examine the workstation activities 
from time to time and adjust the priority or run status of the farm client process accordingly. 

At Mississippi, we have found it satisfactory to allow users to abruptly kill the client process whenever 
they find its activities on their workstations to be troublesome. The server is quickly aware that the client 
has disappeared and adjusts its event distribution accordingly. If a workstation user knows that he will be 
engaged in a computationally intensive activity for a long time, he may edit the client list and remove his 
workstation from the farm. Once killed, a workstation cannot participate as a farm client again until the 
next job is started. Though crude, this scheme has proven highly effective in our operating environment. 

One disadvantage of this approach is that a few input events may be "trapped" in a killed processor, and 
thus lost. We take a rather cavalier attitude toward such losses; with twenty billion events, we can afford to 
lose a few now and again. Although in principle the server can hold event buffers until successful processing 
is assured, and reassign the events to another processor if they become trapped in a disabled client, we don't 
do that. In E791, processing raw events is rather like hauling corn to market in a truck. If a few grains of 
corn fall out of the truck, no one becomes concerned until the loss becomes large enough to be economically 
important. The alternative approach - treating events as babies in a hospital nursery, where one normally 
expects a somewhat stricter accounting - only makes sense when the events have become greatly enriched 
in scientific significance, late in the reconstruction and analysis cycle. 

7. Multiprocessor Operations 

The most vexing operational problems are those that one might expect in handling a dataset this large - 
ensuring that all tapes are processed exactly once, preparing and maintaining run-dependent calibration 
files, making sure that all of the output tapes are correctly labeled, preparing and examining the necessary 
report files, etc. During the first year of operation, job flow was controlled by scripts composed with the 
help of small interactive programs. In 1993, a farm job manager with an X Window graphical user interface 
was written. It was specifically targeted at preventing errors we had observed to occur in the setup of jobs 
and the management of tapes. Most of the day-to-day system operation in later stages was performed by 
students, many of whom had little understanding of the internal workings of the system. 

Farm management software consists of three independent programs: 

(1) The server code, which provides system and dataflow control, 
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(2) The client code, linked to the application code routines, 

(3) The tape-writing code, running on the server, which fetches events written to disk by 
clients and transfers them to tape. 

The server, the tape writer, and each client produces report files for each job. The tape writer (the last 
program to handle the data) gathers these reports together and produces two farm-wide reports. One is a 
standard statistics file common to all E791 computing sites; the other is a report tailored to the needs of the 
Mississippi site. 

8. Future Directions 

Event-parallel computing farms have come to dominate large-scale computations in experimental particle 
physics. Because computing paradigms seldom remain viable for more than a decade or so in this field, it is 
perhaps useful to ask "Whither computer farming?" in the next few years. 

Two trends in the computer industry make it unlikely that "farming as usual" will continue much longer. 
First, individual workstations, PCs, and Macintoshes now coming to market offer such prodigious computing 
power that each processor can exploit the full I/O bandwidth of its data storage peripherals; in that case, 
there seems little point in concentrating dataflow through a server, although one might still imagine a single 
locus of system control for several computers, each directly attached to its own peripheral devices [16]. 

Second, with the increased popularity of object-oriented design and languages such as C++ which support 
it, the data structures now being explicitly passed between processes will be implemented as objects of 
classes. Very soon it will be possible to define remote objects, which will tie together the resources of many 
processors within a single programming environment. For programs written within such a paradigm, there 
will be almost no difference between a single processor implementation and a multiprocessor implementation 
except for listing the computing resources that may be brought to bear on the task. 

In the early days of "computer farming" in particle physics, there were not suitable commercial processors 
available, so we built our own [5,6]. After a short time, we were put out of the processor building business 
by high-powered workstations offered by several vendors [17]. The early implementations of multiprocessor 
management software were complex, costly, and cumbersome to use; the need for them has been snuffed out 
by the widespread availability of interprocess communication tools such as TCP/IP sockets and NFS. The 
simple streamlined multiprocessor management toolkits remain and are likely to be in use for a bit longer, 
but their end is also in sight. For tasks with mammoth computational needs and modest dataflow rates, truly 
transparent, vendor- independent, object-oriented access to the combined power of dozens or hundreds of 
inexpensive powerful processors appears to be imminent; "computer farm management" as we now practice 
it, is an idea whose time has come and nearly gone. 
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Fig. 1. Computing farm configuration at the University of Mississippi. Servers and clients are DECstation 
5000 workstations running ULTRIX. Some have MIPS R3000 Processors; others have the more powerful 
MIPS R4000. Altogether, there are 68 processors organized into four farms, each with a separate job 
stream. One typical farm is shown in this diagram. The two input tape drives alternate. The output is 
staged through disk to tape. This I/O scheme avoids the need for continuous operator supervision. The 
total computing power is about 3000 mips. 
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Fig. 2. A photographic overview of the University of Mississippi computing farm. Servers are on the four 
tables. Clients are on the racks shown as well as on desktops which are not shown. The espresso machine is 
on the left. 
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FOR 

EVERY 

CLIENT.. 



FOR 
EVERY 
INPUT 
RECORD.. 



MAKESOCKET to connect 
with the clients 

BIND_SOCKET to a port 
so clients can find it 



START a client 



LISTEN SOCKET to listen 
for a call from the client 



ACCEPTSOCKET to ^ 
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the client 



SELECT_SOCKET to see 
which client needs service. 
READ SOCKET to obtain 
the request message. 
WRITE_SOCKET to send - 
data length info or 
no more data indication. 



SELECT_SOCKET to see - 
which client needs service. 
READ SOCKET to obtain 
the request message. 
WRITE_SOCKET to send 
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MAKE SOCKET to connect 
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the server 
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READ_SOCKET to read the 
data length 

. WRITE_SOCKET to ask for 
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END 

READ SOCKET to read the OF 
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write the output to disk. 



CLOSE_SOCKET to make 
sure the connection is broken. 



CLOSE_SOCKET to break 
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WRITE REPORT and EXIT 



WRITE REPORT and EXIT 



Fig. 3. Communication between server and clients using TCP/IP sockets. 
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